Reference-guided assembly of four diverse Arabidopsis thaliana genomes.

نویسندگان

  • Korbinian Schneeberger
  • Stephan Ossowski
  • Felix Ott
  • Juliane D Klein
  • Xi Wang
  • Christa Lanz
  • Lisa M Smith
  • Jun Cao
  • Joffrey Fitz
  • Norman Warthmann
  • Stefan R Henz
  • Daniel H Huson
  • Detlef Weigel
چکیده

We present whole-genome assemblies of four divergent Arabidopsis thaliana strains that complement the 125-Mb reference genome sequence released a decade ago. Using a newly developed reference-guided approach, we assembled large contigs from 9 to 42 Gb of Illumina short-read data from the Landsberg erecta (Ler-1), C24, Bur-0, and Kro-0 strains, which have been sequenced as part of the 1,001 Genomes Project for this species. Using alignments against the reference sequence, we first reduced the complexity of the de novo assembly and later integrated reads without similarity to the reference sequence. As an example, half of the noncentromeric C24 genome was covered by scaffolds that are longer than 260 kb, with a maximum of 2.2 Mb. Moreover, over 96% of the reference genome was covered by the reference-guided assembly, compared with only 87% with a complete de novo assembly. Comparisons with 2 Mb of dideoxy sequence reveal that the per-base error rate of the reference-guided assemblies was below 1 in 10,000. Our assemblies provide a detailed, genomewide picture of large-scale differences between A. thaliana individuals, most of which are difficult to access with alignment-consensus methods only. We demonstrate their practical relevance in studying the expression differences of polymorphic genes and show how the analysis of sRNA sequencing data can lead to erroneous conclusions if aligned against the reference genome alone. Genome assemblies, raw reads, and further information are accessible through http://1001genomes.org/projects/assemblies.html.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

AlignGraph: algorithm for secondary de novo genome assembly guided by closely related references

MOTIVATION De novo assemblies of genomes remain one of the most challenging applications in next-generation sequencing. Usually, their results are incomplete and fragmented into hundreds of contigs. Repeats in genomes and sequencing errors are the main reasons for these complications. With the rapidly growing number of sequenced genomes, it is now feasible to improve assemblies by guiding them ...

متن کامل

Differential Expression of Arabidopsis thaliana Acid Phosphatases in Response to Abiotic Stresses

The objective of this research is to identify Arabidopsis thaliana genes encoding acid phosphatases induced by phosphate starvation. Multiple alignments of eukaryotic acid phosphatase amino acid sequences led to the classification of these proteins into four groups including purple acid phosphatases (PAPs). Specific primers were degenerated and designed based on conserved sequences of PAPs isol...

متن کامل

Dissecting a Hidden Gene Duplication: The Arabidopsis thaliana SEC10 Locus

Repetitive sequences present a challenge for genome sequence assembly, and highly similar segmental duplications may disappear from assembled genome sequences. Having found a surprising lack of observable phenotypic deviations and non-Mendelian segregation in Arabidopsis thaliana mutants in SEC10, a gene encoding a core subunit of the exocyst tethering complex, we examined whether this could be...

متن کامل

LOCAS – A Low Coverage Assembly Tool for Resequencing Projects

MOTIVATION Next Generation Sequencing (NGS) is a frequently applied approach to detect sequence variations between highly related genomes. Recent large-scale re-sequencing studies as the Human 1000 Genomes Project utilize NGS data of low coverage to afford sequencing of hundreds of individuals. Here, SNPs and micro-indels can be detected by applying an alignment-consensus approach. However, com...

متن کامل

Yeast Two Hybrid cDNA Screening of Arabidopsis thaliana for SETH4 Protein Interaction

SETH4 coding sequence with 2013 bp is a member of gene family expressed in gametophytic tissues of Arabidopsis thaliana. This fragment was PCR amplified using Kod Hi Fi DNA polymerase enzyme. This fragment was cloned into pGBKT7 bate vector and transformed E. coli DH5? cells containing vector were selected on LB medium containing Kanamycin. Finally, pGBKT7-SETH4 bate was transformed into yeast ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Proceedings of the National Academy of Sciences of the United States of America

دوره 108 25  شماره 

صفحات  -

تاریخ انتشار 2011